Introduction

This report is an exploratory data analysis of weather data from SeaTac International airport in Seattle, Washington. The data covers 67 years of collected data on temperature, precipitation, and weather events such as storms and hail. The goal of this analysis is to explore the relationships between weather variables in order to find interesting and surprising correlations. Taken in the global context of climate change, this analysis also seeks evidence of any local effects.

## [1] 63 15
##  [1] "Year"                "AvgTemp"             "MaxTemp"            
##  [4] "MinTemp"             "YearlyPrecipitation" "AvgWind"            
##  [7] "DaysRain"            "DaysSnow"            "DaysStorm"          
## [10] "DaysFog"             "DaysTornado"         "DaysHail"           
## [13] "DaysPrecipitation"   "PrecipitationPerDay" "AvgTemp.bucket"
## 'data.frame':    63 obs. of  15 variables:
##  $ Year               : int  1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 ...
##  $ AvgTemp            : num  9.4 9.9 9.4 9.9 9.9 10.3 9.6 8.7 9.7 10.2 ...
##  $ MaxTemp            : num  14.7 15.9 14.7 15.7 15.7 15.4 14.6 13.7 15 15.4 ...
##  $ MinTemp            : num  5.3 5.4 5.1 5.2 5.4 6.3 5.6 4.9 5.7 6.4 ...
##  $ YearlyPrecipitation: num  1170 845 1408 1033 611 ...
##  $ AvgWind            : num  14.3 14 16.7 16.7 15.8 15.2 17.5 19 19.7 19 ...
##  $ DaysRain           : int  212 175 218 163 157 227 214 210 177 189 ...
##  $ DaysSnow           : int  22 42 36 35 26 10 24 39 33 21 ...
##  $ DaysStorm          : int  20 20 10 8 4 10 6 5 4 4 ...
##  $ DaysFog            : int  159 137 153 144 151 150 142 162 146 153 ...
##  $ DaysTornado        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ DaysHail           : int  4 2 1 0 2 1 0 5 0 2 ...
##  $ DaysPrecipitation  : int  234 217 254 198 183 237 238 249 210 210 ...
##  $ PrecipitationPerDay: num  5 3.89 5.54 5.21 3.34 ...
##  $ AvgTemp.bucket     : Factor w/ 4 levels "(8.69,10.4]",..: 1 1 1 1 1 1 1 1 1 1 ...
##       Year         AvgTemp         MaxTemp         MinTemp     
##  Min.   :1948   Min.   : 8.70   Min.   :13.70   Min.   :4.900  
##  1st Qu.:1964   1st Qu.:10.40   1st Qu.:15.65   1st Qu.:6.300  
##  Median :1982   Median :11.00   Median :16.10   Median :6.700  
##  Mean   :1981   Mean   :10.87   Mean   :16.15   Mean   :6.643  
##  3rd Qu.:1998   3rd Qu.:11.30   3rd Qu.:16.65   3rd Qu.:7.100  
##  Max.   :2015   Max.   :12.90   Max.   :18.40   Max.   :8.300  
##                                                                
##  YearlyPrecipitation    AvgWind         DaysRain        DaysSnow   
##  Min.   : 611.4      Min.   :10.20   Min.   :144.0   Min.   : 3.0  
##  1st Qu.: 884.8      1st Qu.:12.15   1st Qu.:177.0   1st Qu.: 8.0  
##  Median :1004.4      Median :13.70   Median :190.0   Median :15.0  
##  Mean   :1008.8      Mean   :13.78   Mean   :188.8   Mean   :16.4  
##  3rd Qu.:1125.1      3rd Qu.:14.65   3rd Qu.:204.5   3rd Qu.:23.0  
##  Max.   :1408.2      Max.   :19.70   Max.   :227.0   Max.   :42.0  
##  NA's   :9                                                         
##    DaysStorm         DaysFog       DaysTornado         DaysHail     
##  Min.   : 0.000   Min.   : 12.0   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.: 4.000   1st Qu.: 55.0   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median : 6.000   Median :149.0   Median :0.00000   Median :1.0000  
##  Mean   : 6.619   Mean   :122.2   Mean   :0.01587   Mean   :0.8571  
##  3rd Qu.: 8.000   3rd Qu.:163.0   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :20.000   Max.   :186.0   Max.   :1.00000   Max.   :5.0000  
##                                                                     
##  DaysPrecipitation PrecipitationPerDay     AvgTemp.bucket
##  Min.   :153.0     Min.   :3.341       (8.69,10.4]:17    
##  1st Qu.:189.5     1st Qu.:4.468       (10.4,11]  :17    
##  Median :205.0     Median :4.914       (11,11.3]  :14    
##  Mean   :205.2     Mean   :4.972       (11.3,12.9]:15    
##  3rd Qu.:218.5     3rd Qu.:5.412                         
##  Max.   :254.0     Max.   :7.589                         
##                    NA's   :9

Univariate Plots Section

Average temperature is approximately normally distributed

Days of rain is skewed left. Mode at 210.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     8.0    15.0    16.4    23.0    42.0

Days of snow is skewed left. Modes at 8 and 14. Median at 15. Outliers at 39 and 42.

Roughly normal, but has 3 peaks. Days of Precipitation is calculated by adding days of rain and days of snow.

Again, roughly normal.

Skewed right. Some outliers.

Clear bimodal distribution, what is going on here?

Univariate Analysis

What is the structure of your dataset?

The dataset contains 63 observations in 15 variables(3 created). Each observation corresponds to a year of data from SeaTac international Airport in Seattle. The years 2002, 2005, and 2016 were N/A in the original dataset and have been removed.

What is/are the main feature(s) of interest in your dataset?

AvgTemp, or Average Temperature, is the main feature of interest in the dataset.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I will be looking at average rainfall, snowfall, and wind velocity, days of storm, hail and fog as well as precipitation per year to explore the possible effects of rising temperatures.

Did you create any new variables from existing variables in the dataset?

Yes. I created 3 new variables: DaysPrecipitation, which is the number of days of snow plus the number of days of rain, Precipitation per day, which is the annual precipitation divided by the days of precipitation. I also created a Average Temperature bucket which divides each observation into the quartiles of average temperature.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most of the distributions I examined were approximately normal. Fog has an unusual binomial distribution which will be addressed in the next section.

Bivariate Plots Section

##  [1] "Year"                "AvgTemp"             "MaxTemp"            
##  [4] "MinTemp"             "YearlyPrecipitation" "AvgWind"            
##  [7] "DaysRain"            "DaysSnow"            "DaysStorm"          
## [10] "DaysFog"             "DaysTornado"         "DaysHail"           
## [13] "DaysPrecipitation"   "PrecipitationPerDay" "AvgTemp.bucket"

Key correlations:

Average temperature and average wind = -.514

Average temperature and days of snow = -.708

Average yearly temperature has been increasing gradually for the past 67 years.

The number of days of rain per year has been relatively stable except for the past few years.

There has been a dramatic decrease in the number of days of snow per year.

Given the decline of both rain and snow days, it is no surprise that the combination of the two also declines.

Aside from the missing years, it appears that the amount of yearly precipitation has remained the same, although there is a large amount of variability from year to year.

As the yearly precipitation stays the same and the number of snow and rain days decreases, the amount of precipitation per precipitation event must increase

There is a clear negative correlation (-.708) between averate temperature and days of snow.

Another negative correlation (-.298).

This is a surprising find. How would average wind speed be affected by temperature?

Both of these are highly negatively correlated to temperature, and as a result are correlated to eachother.

Dashed line is y=x. Points below this line represent years with more storms than hail. This is intuitive because most hail is expected to happen within storm conditions. There is one year where hail happened twice and there was only one storm. This is an interesting phenomenon but does not call attention to a larger trend.

Here’s the source of the bimodality seen in the univariate plot section.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

As the year increases, average temperature increases, with 2014 and 2015 having significant increases when compared to the previous 50 years. Some years, such as 1958 and 1992, spike but drop the next year. As the year increases, the number of days of rain and the number of days of snow decrease. This is contrasted by a consistent annual precipitation, suggesting an increase in the duration/severity of daily precipitation events. This increase is not due to more extreme weather events, as the number of days of stormy weather has remained approximately constant over time.

While examining these trends with regard to average temperature, several strong correlations were noted. The strongest relationship was between average temperature and days of snow, with a correlation of -.708, suggesting a strong decrease in snow days as temperature rises. In contrast, there is only a -.298 relationship between days of rain and temperature. Given the strong correlation between days of rain and snow and the precipitation per day feature, it is not surprising that the correlation is -.403. Another interesting correlation is between average temperature and average wind speed, with a correlation of -.551. This suggests that rising temperatures correspond with a decrease in wind speed.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes. Average wind speed and days of snow have a correlation of .483. This makes sense in light of earlier findings that average temperature has a strong relationship to both features, so that years with high average temperatures correspond to fewer snow days and a lower average wind speed. This relationship will be addressed further in the multivariate plots section.

Looking at days of hail vs. days of storm reveals that in all but one year there were more storms than hail events. This is an expected result, as most hail occurs during storm events. Plotting days of fog by year shows a dramatic 5-fold drop-off between 1996 and 1997. This will be explored further in the final plots section.

What was the strongest relationship you found?

The strongest relationship in the data is between the average temperature and days of snow, with a correlation of -.708.

Multivariate Plots Section

Maximum and minimum temperatures follow average temperature very closely. The dashed line shows the divergence in this following, but is fairly consistent, even in light of the general increase in temperature.

## sw$AvgTemp.bucket: (8.69,10.4]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   19.00   24.00   25.47   33.00   42.00 
## -------------------------------------------------------- 
## sw$AvgTemp.bucket: (10.4,11]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   14.00   16.00   16.35   23.00   25.00 
## -------------------------------------------------------- 
## sw$AvgTemp.bucket: (11,11.3]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    8.00   12.00   13.36   19.00   26.00 
## -------------------------------------------------------- 
## sw$AvgTemp.bucket: (11.3,12.9]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     5.5     7.0     9.0    10.5    22.0

Days of snow broken up by temperature quartiles. The warmest temperatures occur the most recently, and have the lowest number of days of snow. Each subsequent temperature bucket has decreases in snow days across the board.

## sw$AvgTemp.bucket: (8.69,10.4]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.80   14.60   15.60   15.91   16.80   19.70 
## -------------------------------------------------------- 
## sw$AvgTemp.bucket: (10.4,11]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.8    11.9    13.2    12.9    13.8    15.2 
## -------------------------------------------------------- 
## sw$AvgTemp.bucket: (11,11.3]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.20   12.37   13.25   13.11   13.88   14.70 
## -------------------------------------------------------- 
## sw$AvgTemp.bucket: (11.3,12.9]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.90   11.45   13.10   13.00   13.80   17.90

Average wind speed broken up by temperature quartiles. Compared to days of snow, the temperature buckets are not as emphatic. There is a decrease in median across the buckets but not in any other measure.

Putting the previous graphs together and taking out year provides a different look at the same data.

Same as the previous, divided into separate plots rather than color. Each increase in temperature bucket leads to increasing clustering in the lower left corner.

The more extreme values of precipitation per day are found in the higher temperature buckets. Whether this is a coincidence is unclear.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The first plot in this section compares the average min, average max, and average yearly temperatures. As expected, the minimum and maximum temperature averages closely follow the average temperature, with yearly spikes and drops showing across all three variables. The dashed line, MaxTemp - MinTemp, shows years in which this is not the case. For example, the dashed line drops in 1997, a sign that the average minimum temperature that year increased but the the average maximum did not. Aside from these minor deviations, the dashed line shows that despite gains in average temperature over time, the variance in temperature remains about the same.

The next several plots detail the relationship between average wind speed, days of snow, year, and temperature. This is done in 4 plots. The first two plots show days of snow and average wind speed respectively by year, split over average temperature buckets divided into quartiles. These plots show that both average wind speed and and days of snow decrease as year and temperature increase. They also show that the warmer the temperature, the more recent the year is likely to be. Plot 3 removes year to emphasize the relationships between average wind, snow days, and temperature. Plot 4 separates the colors of plot 3 into separate plots for clarity.

Were there any interesting or surprising interactions between features?

I would say the inverse relationship between average temperature and average wind speed is surprising. It is not surprising that the number of days of snow decreases as temperature rises, as snow only happens in cold weather. In contrast, it is not intuitive that wind speed should decrease with an increase in temperature. Wind is formed by differences in air pressure that result from differences in temperature. This implies that the decrease in average wind speed is the result of a smaller temperature differential. As such, it is likely that this decrease in wind speed is related not only to local temperatures, but to variable rates of temperature increases in the region following a trend of global warming. I think it is interesting that wind speed might be a local effect of global warming.


Final Plots and Summary

Plot One

Description One

Minimum and maximum temperatures (averaged over the year) closely follow the average temperature trend. The dashed line, Maximimum - Minimum shows some deviation year to year but overall displays that the range of temperatures stays approximately the same, even as the average temperature has increased 2-3 C over 65 years.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.900   9.050   9.400   9.505   9.900  11.100

The median of Maximum - Minimum is 9.4 degrees, with a mininum difference of 7.9 and a max of 11.1. The 1st and 3rd quartiles are 9.05 and 9.9 respectively, indicating that 50% of the values fall within a <1 degree range. The histogram illustrates this, with 11.1 and 7.9 being clear outliers. The consistency of this variable is important as it signifies the consistency of variation in temperature year to year. For now at least the increase in average temperature seems not to effect this variation. This is important because increased swings in temperature are likely to have a strong impact on wild ecosystems and crop growth.

Plot Two

Description Two

Average temperature has a strong inverse correlation to average wind speed and days of snow per year. These correlations are -.514 and -.708 respectively. This plot shows that the greater the temperature (bucket) the more clustered the points are in the lower left corner of the plot, corresponding to lower average wind speed and fewer days of snow.

Days of snow:

## sw$AvgTemp.bucket: (8.69,10.4]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   19.00   24.00   25.47   33.00   42.00 
## -------------------------------------------------------- 
## sw$AvgTemp.bucket: (10.4,11]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   14.00   16.00   16.35   23.00   25.00 
## -------------------------------------------------------- 
## sw$AvgTemp.bucket: (11,11.3]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00    8.00   12.00   13.36   19.00   26.00 
## -------------------------------------------------------- 
## sw$AvgTemp.bucket: (11.3,12.9]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     5.5     7.0     9.0    10.5    22.0

Days of wind:

## sw$AvgTemp.bucket: (8.69,10.4]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   11.80   14.60   15.60   15.91   16.80   19.70 
## -------------------------------------------------------- 
## sw$AvgTemp.bucket: (10.4,11]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    10.8    11.9    13.2    12.9    13.8    15.2 
## -------------------------------------------------------- 
## sw$AvgTemp.bucket: (11,11.3]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.20   12.37   13.25   13.11   13.88   14.70 
## -------------------------------------------------------- 
## sw$AvgTemp.bucket: (11.3,12.9]
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.90   11.45   13.10   13.00   13.80   17.90

Looking at the statistics above, there are decreases across the board for days of snow as average temperature (bucket) increases as well as average wind speed. Days of snow is a particularly stunning figure, as the median drops from 24 in the coldest bucket to 7 in the warmest. That the median days of snow drops by a factor of 3 with around a 3 degree change is surprising in intensity.

Plot Three

Description Three

A massive dropoff in the days of fog by a factor of five between 1995 and 1997 raises some questions. A search for changes in fog definition and fog reporting standards returned nothing substantial. An examination of the website which provides this data found similar trends nationally -

Kansas City - http://en.tutiempo.net/climate/ws-724463.html

New York - http://en.tutiempo.net/climate/ws-744860.html

Similar trends are seen internationally, but the years of dropoff vary by country and some countries have no dropoff at all -

London - http://en.tutiempo.net/climate/ws-37720.html

Tokyo - http://en.tutiempo.net/climate/ws-476710.html

Rome - http://en.tutiempo.net/climate/ws-162390.html


Reflection

The dataset analyzed contains weather information from SeaTac airport in Seattle, ranging from 1948 to 2015. The first difficulty I ran into was missing values. I decided to remove 3 years from the dataset which were missing values in all variables, but decided to keep 8 years which were missing values in yearly precipitation only. My first pass at analysis looking at univariate distributions turned up a only 1 bimodal variable, days of fog. This became an extra path of inquiry along with my original intention of looking at the relationships of many of the variables with temperature. Moving on to bivariate analysis, I found making a matrix of plots with ggpairs was essential for locating correlations within the data. It guided me to the correlation between temperature and days of snow/wind speed. It also showed me no correlations to days of fog between any of the variables, which prompted me to plot fog by year. This plot (final plot 3) instantly clarified the bimodal distribution found in the univariate analysis. It is an example of how an initially promising piece of data can be misleading, and finding an explanation for it was one of the big difficulties in my analysis. In contrast, the plots for temperature and days of snow/wind speed were compelling out of the box and represent the bulk of success in my analysis. In the multivariate analysis I had trouble finding 3 variables that had interesting relationships. To that end I created several variables including precipitation per day and temperature buckets to futher my analysis. Moving forward, the results raise some questions. Is wind speed similarly effected by temperature in other locations? What other factors might affect wind speed beside temperature that arent included in this analysis? Similar questions can be asked for days of snow as well.

References

http://en.tutiempo.net/climate/ws-727930.html

https://www.rstudio.com/wp-content/uploads/2015/02/rmarkdown-cheatsheet.pdf

https://s3.amazonaws.com/udacity-hosted-downloads/ud651/diamondsExample.html